Skip to content

Unify SIMD arithmetic under a shared transform_binary template#740

Merged
yungyuc merged 2 commits intosolvcon:masterfrom
tigercosmos:issue646
May 6, 2026
Merged

Unify SIMD arithmetic under a shared transform_binary template#740
yungyuc merged 2 commits intosolvcon:masterfrom
tigercosmos:issue646

Conversation

@tigercosmos
Copy link
Copy Markdown
Collaborator

@tigercosmos tigercosmos commented Apr 19, 2026

Summary

Refs #646 (Task 2). The generic and NEON backends each had four near-identical loops for add / sub / mul / div. This PR collapses them into a single transform_binary per backend that takes the operation as an injected functor.

What changed

In the generic backend, the four ops now pass std::plus / std::minus / std::multiplies / std::divides into transform_binary. In the NEON backend they pass vec_add / vec_sub / vec_mul / vec_div wrappers around neon_alias, and std::invocable<VecOp, vec_t, vec_t> routes types without a matching vector overload (e.g. int64 for vmulq) to the scalar path at compile time. This replaces the ad-hoc vector_lane > 2 and is_floating_point_v guards.

Bugs fixed along the way

  • Sub-lane UB in NEON. ptr <= dest_end - N_lane formed a pointer before the buffer when the input was shorter than one SIMD lane. The vector loop now runs a counted trip count (blocks = (dest_end - dest) / N_lane), which is UB-safe on sub-lane inputs and lowers to a subs/b.ne macro-op-fused back-edge on AArch64. The scalar remainder is inline instead of a recursive call into generic::. The same rewrite is applied to check_between.
  • check_between diagnostic. The SIMD body checked the >= max mask first and only looked at < min if the first was empty, so a later too-large lane could hide an earlier too-small one. Both bounds are now inspected before picking the returned pointer.
  • has_vectype typing. Declared size_t; retyped to bool to match its predicate role.

Profiling

Verified on Apple M3 Pro (clang 17, -O3 -DNDEBUG -mcpu=apple-m1):

  • transform_binary inlines fully — no functor calls, no extra moves, no spills. Hot loop is 5 instr/iter with one fused control-chain macro-op (subs/b.ne), matching the hand-written master baseline byte-for-byte in shape.
  • Throughput on cache-resident buffers (n = 256 / 16384) is within ±2% of master across add/sub/mul/div × float/int32/double, well inside run-to-run noise. mul<int64> (scalar fallback) is byte-identical.
  • The same counted-trip form on check_between measures ~1.8× faster than master in an inlined hot scan, because the new loop shape avoids a register-shuffle the post-incrementing form was triggering under inlining pressure.

Tests

tests/test_buffer_simd.py pins _simd_feature() == "NEON" on aarch64 so a silent fallback to the scalar path cannot pass unnoticed. It then covers the int32 shape matrix (n = 1, 3, 4, 5, 8, 17) for transform_binary, the int64-mul SFINAE fallback, and float sub/mul/div with one block + tail. A new private modmesh._modmesh._simd_feature() binding exposes the runtime-detected backend.

Follow-up

simd::check_between has inconsistent bound semantics across paths: the NEON SIMD body treats value == max_val as out-of-range, while the scalar fallback accepts it. Out of scope here; left for a separate change.

Test plan

  • make gtest
  • tests/test_buffer_simd.py on aarch64
  • CI on Linux / macOS / aarch64

🤖 Generated with Claude Code

@tigercosmos tigercosmos marked this pull request as draft April 19, 2026 12:25
@tigercosmos tigercosmos force-pushed the issue646 branch 2 times, most recently from 488d5cd to 11a6eb1 Compare April 19, 2026 13:09
@tigercosmos tigercosmos changed the title Refactor SIMD to xsimd-style loop injection Unify SIMD arithmetic under a shared transform_binary template Apr 27, 2026
@tigercosmos tigercosmos marked this pull request as ready for review April 27, 2026 20:38
Copy link
Copy Markdown
Collaborator Author

@tigercosmos tigercosmos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yungyuc The PR is ready for review. Thanks!

// every correctness check. Kept under an underscore-prefixed name
// because detect_simd() only meaningfully reflects the dispatched
// backend on aarch64 today; on other targets it would mislead users.
mod.def("_simd_feature", &simd_feature_name);
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For checking if simd is working.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cpp/modmesh/toggle/ may be a more on-topic module for the SIMD check, but it's fine to have it here in buffer.

Comment on lines +97 to +101
struct vec_add
{
return generic::check_between<T>(start, end, min_val, max_val);
}

template <typename T, typename std::enable_if_t<type::has_vectype<T>> * = nullptr>
const T * check_between(T const * start, T const * end, T const & min_val, T const & max_val)
template <typename V>
static auto operator()(V a, V b) -> decltype(vaddq(a, b)) { return vaddq(a, b); }
};
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Key design in this PR.

constexpr size_t N_lane = type::vector_lane<T>;
if constexpr (!std::invocable<VecOp, vec_t, vec_t>)
{
generic::transform_binary<T>(dest, dest_end, src1, src2, scalar_op);
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

T does have a vector type, but the specific VecOp functor can't be called with it. For example, vdivq doesn't exist for integer vector types in NEON, so vec_div{} isn't invocable with int32x4_t.

{
vec_t v1 = vld1q(src1);
vec_t v2 = vld1q(src2);
vst1q(ptr, vec_op(v1, v2));
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vec_op is called here.

if constexpr (!type::has_vectype<T>)
{
return generic::add<T>(dest, dest_end, src1, src2);
generic::transform_binary<T>(dest, dest_end, src1, src2, scalar_op);
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scalar type T itself has no corresponding NEON vector type (e.g., bool, int64_t). There's no vector register representation at all, so SIMD is impossible.


#include <cstddef>
#include <arm_neon.h>
#include <cstddef>
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Formattor fixes the order, I think it should be fine. Let me know if I should revert it.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's fine.


template <typename T>
inline constexpr size_t has_vectype = detail::vector<T>::N_lane > 0;
inline constexpr bool has_vectype = detail::vector<T>::N_lane > 0;
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed the boolean type.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch.

Comment on lines +70 to 73
inline void add(T * dest, T const * dest_end, T const * src1, T const * src2)
{
T * ptr = dest;
while (ptr < dest_end)
{
*ptr = *src1 - *src2;
++ptr;
++src1;
++src2;
}
transform_binary<T>(dest, dest_end, src1, src2, std::plus<T>{});
}
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Main design of this PR.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if the additional abstraction still generates good SIMD binaries. Please profile to check. If you have time, also check the built assembly.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if the additional abstraction still generates good SIMD binaries. Please profile to check. If you have time, also check the built assembly.

Profiled on Apple M3 Pro (clang 17, -O3 -DNDEBUG -mcpu=apple-m1). Both the assembly and the throughput look clean.

Assembly. The transform_binary<T, ScalarOp, VecOp> template inlines fully — no functor calls, no extra moves, no spills. Hot-loop body for simd_add_f32:

; BEFORE (master)                     ; AFTER (this PR)
ldr     q0, [x2], #16                  ldr     q0, [x2], #16
ldr     q1, [x3], #16                  ldr     q1, [x3], #16
fadd.4s v0, v0, v1                     fadd.4s v0, v0, v1
str     q0, [x0], #16                  str     q0, [x0], #16
cmp     x0, x8                         subs    x8, x8, #1
b.ls    LBB0_1                         b.ne    LBB0_2

5 instr/iter on both sides; one fused control-chain macro-op (cmp/b.lssubs/b.ne for fusion purposes). Same pattern for sub/mul/div/add<int32>/mul<int32>/add<double>. mul<int64> (scalar fallback via std::invocable<VecOp, vec_t, vec_t>) is byte-identical to master.

Throughput (Gelem/s, n=16384, L2-resident, median of 3 runs):

Op BEFORE AFTER Δ
add<float> 14.91 14.78 −0.9%
mul<float> 14.82 14.77 −0.4%
div<float> 14.78 14.76 −0.2%
add<int32> 14.87 14.71 −1.0%
add<double> 5.63 5.69 +1.2%

All ops within ±2% of master across L1/L2-resident sizes — well inside run-to-run noise. Full write-up + reproducer in profiling/simd_pr740/.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Profiled on Apple M3 Pro (clang 17, -O3 -DNDEBUG -mcpu=apple-m1). Both the assembly and the throughput look clean.

The profiling results and assembly look good. But clang 17 looks old. The latest version provided by xcode is version 21.0.0 (clang-2100.0.123.102). Old compiler is OK since both before and after use the same version.

Comment thread tests/test_buffer_simd.py
Comment on lines +45 to +46
if platform.machine() in ("arm64", "aarch64"):
self.assertEqual(feature, "NEON")
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check if NEON is working that we didn't test it before.

Comment thread tests/test_buffer_simd.py
self.skipTest("_simd_feature() = " + feature)


class SimdTransformBinaryTC(unittest.TestCase):
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some cases for checking transform_binary functionality.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you isolate this unit test out from test_buffer.py?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole SIMD implementation is also outside buffer directory. I think it worths a new file.

@yungyuc
Copy link
Copy Markdown
Member

yungyuc commented Apr 27, 2026

@KHLee529 Could you please take a look?

@yungyuc yungyuc requested review from KHLee529 and yungyuc April 27, 2026 22:53
@yungyuc yungyuc added performance Profiling, runtime, and memory consumption array Multi-dimensional array implementation labels Apr 27, 2026
@yungyuc yungyuc moved this from Todo to In Progress in tensor operations Apr 27, 2026
@yungyuc yungyuc moved this to In Progress in tabular data processing Apr 27, 2026
@KHLee529
Copy link
Copy Markdown
Collaborator

The unified backend look nice in my first glance. I'll dive into details later.

Copy link
Copy Markdown
Member

@yungyuc yungyuc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Clarify if some for loops can also be replaced with while.
  • Run performance test to compare the runtime before and after the change. List the results to show that the change does not degrade runtime performance.
  • Rename test_simd.py to test_buffer_simd.py. We can discuss which name is better.

// every correctness check. Kept under an underscore-prefixed name
// because detect_simd() only meaningfully reflects the dispatched
// backend on aarch64 today; on other targets it would mislead users.
mod.def("_simd_feature", &simd_feature_name);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cpp/modmesh/toggle/ may be a more on-topic module for the SIMD check, but it's fine to have it here in buffer.

Comment thread cpp/modmesh/simd/neon/neon.hpp Outdated
// Vector loop runs while a full lane still fits. The remaining-count
// form keeps the condition valid for buffers shorter than one lane.
T const * ptr = start;
while (static_cast<size_t>(end - ptr) >= N_lane)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while reads clearer than for.

Comment thread cpp/modmesh/simd/neon/neon.hpp Outdated
if (ptr != dest_end)

// Tail scalar loop for remaining elements
for (; ptr < end; ++ptr)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not using while too?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


#include <cstddef>
#include <arm_neon.h>
#include <cstddef>
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's fine.


template <typename T>
inline constexpr size_t has_vectype = detail::vector<T>::N_lane > 0;
inline constexpr bool has_vectype = detail::vector<T>::N_lane > 0;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch.

Comment on lines +70 to 73
inline void add(T * dest, T const * dest_end, T const * src1, T const * src2)
{
T * ptr = dest;
while (ptr < dest_end)
{
*ptr = *src1 - *src2;
++ptr;
++src1;
++src2;
}
transform_binary<T>(dest, dest_end, src1, src2, std::plus<T>{});
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if the additional abstraction still generates good SIMD binaries. Please profile to check. If you have time, also check the built assembly.

Comment thread tests/test_buffer_simd.py
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since most tests are against SimpleArray, I suggest to name the new test file as test_buffer_simd.py?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Copy link
Copy Markdown
Collaborator

@KHLee529 KHLee529 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No change requested. Only some comments and questions listed.

template <typename T, typename std::enable_if_t<type::has_vectype<T>> * = nullptr>
const T * check_between(T const * start, T const * end, T const & min_val, T const & max_val)
template <typename V>
static auto operator()(V a, V b) -> decltype(vaddq(a, b)) { return vaddq(a, b); }
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can these operator helper functions be also inlined? Based on my experience profiling the speed of SimpleArray SIMD operations, whether the vector operations are inlined impact a lot on the performance

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are implicit inlined, and confirmed by the profiling.

}
while (ptr < dest_end)
{
*ptr = scalar_op(*src1, *src2);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice way to remove dependency to generic functions.

{
T idx = *ptr;
if (idx < min_val || idx > max_val)
if (*ptr < min_val || *ptr > max_val)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this refinement potentially slower due to one more dereference execution?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No — clang CSEs the two textual *ptr reads into a single load per iteration. Same one ldr in both versions:

  ; BEFORE                              ; AFTER                
  ldr  w10, [x0], #4                    ldr  w10, [x0]         
  cmp  w10, w8                          cmp  w10, w8           
  ccmp w10, w9, #0, ge                  ccmp w10, w9, #0, ge 
  b.le ...                              b.gt ...               
                                        add  x0, x0, #4
                                        cmp  x0, x1            
                                        b.lo ...             

No extra dereference. In fact, when this function inlines into a hot caller the new form is measurably faster (~1.8× on a tight scan loop, M3 Pro) — the for (...; ++ptr) shape decouples the load operand from the pointer-bump, which gives the register allocator more freedom and avoids a redundant register-shuffle that the post incrementing ldr [x0], #4 triggers under inlining pressure.

Comment thread tests/test_buffer_simd.py
self.skipTest("_simd_feature() = " + feature)


class SimdTransformBinaryTC(unittest.TestCase):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you isolate this unit test out from test_buffer.py?

Comment on lines +216 to +221
// Vector loop runs while a full lane still fits. Counted trip form
// for the same reason as transform_binary above: avoids UB on
// sub-lane inputs and the per-iter `sub` overhead.
size_t const blocks = static_cast<size_t>(end - start) / N_lane;
T const * ptr = start;
for (size_t block = 0; block < blocks; ++block)
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An intermediate revision used while (dest_end - ptr >= N_lane) for the bound check. That form is UB-safe but forces clang to emit a non-flag-setting sub + cmp #12 pair, which breaks macro-op fusion on AArch64 and showed a real ~20–25% regression on cache-resident loops. The current head replaces it with a counted trip count (blocks = (dest_end - dest) / N_lane), which restores fusion. That is what the numbers above measure.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not know that. Good to learn.

Refs solvcon#646 (Task 2). The generic and NEON backends each had four
near-identical loops for add/sub/mul/div. Collapse them into a single
transform_binary per backend that takes the operation as an injected
functor.

In the generic backend, the four ops now pass std::plus / std::minus /
std::multiplies / std::divides into transform_binary. In the NEON
backend they pass vec_add / vec_sub / vec_mul / vec_div wrappers
around neon_alias, and std::invocable<VecOp, vec_t, vec_t> routes
types without a matching vector overload (e.g. int64 for vmulq) to
the scalar path at compile time. This replaces the ad-hoc
vector_lane > 2 and is_floating_point_v guards.

Bugs fixed along the way:

- Sub-lane UB in NEON: `ptr <= dest_end - N_lane` formed a pointer
  before the buffer when the input was shorter than one SIMD lane.
  The vector loop now runs a counted trip count
  (`blocks = (dest_end - dest) / N_lane`), which is UB-safe on
  sub-lane inputs and lowers to a `subs/b.ne` macro-op-fused back-edge
  on AArch64. The scalar remainder is inline instead of a recursive
  call into generic::. The same rewrite is applied to check_between.
- check_between diagnostic: the SIMD body checked the >= max mask
  first and only looked at < min if the first was empty, so a later
  too-large lane could hide an earlier too-small one. Both bounds are
  now inspected before picking the returned pointer.
- has_vectype: declared as size_t; retyped to bool to match its
  predicate role.

tests/test_buffer_simd.py pins _simd_feature() == "NEON" on aarch64
so a silent fallback to the scalar path cannot pass unnoticed. It
covers the int32 shape matrix (n=1, 3, 4, 5, 8, 17) for
transform_binary, the int64-mul SFINAE fallback, and float
sub/mul/div with one block + tail. A new private
modmesh._modmesh._simd_feature() binding exposes the runtime-detected
backend.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@tigercosmos
Copy link
Copy Markdown
Collaborator Author

@yungyuc Please take a look. All comments are addressed. Please take a look, thanks!

Copy link
Copy Markdown
Member

@yungyuc yungyuc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines +216 to +221
// Vector loop runs while a full lane still fits. Counted trip form
// for the same reason as transform_binary above: avoids UB on
// sub-lane inputs and the per-iter `sub` overhead.
size_t const blocks = static_cast<size_t>(end - start) / N_lane;
T const * ptr = start;
for (size_t block = 0; block < blocks; ++block)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not know that. Good to learn.

Comment on lines +70 to 73
inline void add(T * dest, T const * dest_end, T const * src1, T const * src2)
{
T * ptr = dest;
while (ptr < dest_end)
{
*ptr = *src1 - *src2;
++ptr;
++src1;
++src2;
}
transform_binary<T>(dest, dest_end, src1, src2, std::plus<T>{});
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Profiled on Apple M3 Pro (clang 17, -O3 -DNDEBUG -mcpu=apple-m1). Both the assembly and the throughput look clean.

The profiling results and assembly look good. But clang 17 looks old. The latest version provided by xcode is version 21.0.0 (clang-2100.0.123.102). Old compiler is OK since both before and after use the same version.

* Ignore:
  * readability-static-definition-in-anonymous-namespace
  * misc-use-anonymous-namespace
* Add missing const
* Use auto to avoid duplicate type name
@yungyuc yungyuc merged commit 3142aec into solvcon:master May 6, 2026
14 checks passed
@github-project-automation github-project-automation Bot moved this from In Progress to Done in tabular data processing May 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

array Multi-dimensional array implementation performance Profiling, runtime, and memory consumption

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants